install.packages("compareGroups")
library(compareGroups)
Germans Trias i Pujol Research Institute and Hospital (IGTP)
Badalona, Spain
April 2, 2025
1. Statistical models
2. Simple linear regression model
3. Multiple linear regression model
4. Logistic regression model
5. Take home messages
A statistical model establishes a relationship between a dependent variable and one or more independent variables by means of an equation:
\[Y=f(X_1, \dots, X_k) + e\]
\(Y\): response, dependent, predicted variable
\(X_1, \dots, X_k\): predictors, independent, explanatory variables
\(e\): error term. Random variable with a given distribution. Part of \(Y\) that \(X_1, \dots, X_k\) are unable to explain.
Explanatory: Which variables \(X_i\) are important to explain \(Y\)? Which is the effect of \(X_i\) on \(Y\)?
Predictive: Predict \(Y\) given \(X_i\). Which value of \(Y\) can we expect for an individual with certain characteristics \(X_1, \dots, X_k\)?
Depending on the nature of the response variable \(Y\) there are different types of models:
\(Y\) continuous \(\longrightarrow\) Linear regression models
\(Y\) categorical with 2 categories \(\longrightarrow\) Logistic regression models
\(Y\) categorical with more than 2 categories \(\longrightarrow\) Multinomial regression models, ordinal regression models
\(Y\) counts \(\longrightarrow\) Poisson regression models
\(Y\) time to an event \(\longrightarrow\) Cox regression models
Depending on the nature of the response variable \(Y\) there are different types of models:
\(Y\) continuous \(\longrightarrow\) Linear regression models
\(Y\) categorical with 2 categories \(\longrightarrow\) Logistic regression models
\(Y\) categorical with more than 2 categories \(\longrightarrow\) Multinomial regression models, ordinal regression models
\(Y\) counts \(\longrightarrow\) Poisson regression models
\(Y\) time to an event \(\longrightarrow\) Cox regression models (next session)
A linear regression model establishes a relationship between:
A quantitative response variable \(Y\)
Quantitative or qualitative predictor variables \(X_1, \dots, X_k\)
Formula specification
\[Y=\beta_{0}+\beta_{1}X_1+\beta_{2}X_2+\ldots+\beta_{k}X_k+e\]
Goal: To quantify the relationship between a dependent variable \(Y\) and an independent variable \(X\)
\(Y\): Dependent/response/outcome variable
\(X\): Independent/predictor/explanatory variable
Linear relationship:
\[Y= \beta_0 + \beta_1 X + e\]
\(\beta_0\) \(\longrightarrow\) \(Y\) expected value when \(X=0\).
\(\beta_1\) > 0 \(\longrightarrow\) When \(X\) increases, \(Y\) increases.
\(\beta_1\) < 0 \(\longrightarrow\) When \(X\) increases, \(Y\) decreases.
We can do a hypothesis test on \(\beta_1\) to test whether there is statistically significant evidence that X is associated with Y:
First, install and load the {compareGroups} package into your R session
We will work with a sample dataset from this package:
This dataset contains data from 2294 individuals from the REGICOR study (Registre Gironí del COR).
| id | year | age | sex | smoker | sbp | dbp | histhtn | txhtn | chol | hdl | triglyc | ldl | histchol | txchol | height | weight | bmi | phyact | pcs | mcs | cv | tocv | death | todeath |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2265 | 2005 | 70 | Female | Never smoker | 138 | 75 | No | No | 294 | 57.00000 | 93 | 218.4000 | No | No | 160 | 64 | 25.00000 | 304.2000 | 54.455 | 58.918 | No | 1024.882 | Yes | 1299.16343 |
| 1882 | 2005 | 56 | Female | Never smoker | 139 | 89 | No | No | 220 | 50.00000 | 160 | 138.0000 | No | No | 163 | 67 | 25.21736 | 160.3000 | 58.165 | 47.995 | No | 2756.849 | No | 39.32629 |
| 3000105616 | 2000 | 37 | Male | Current or former < 1y | 132 | 82 | No | No | 245 | 59.80429 | 89 | 167.3957 | No | No | 170 | 70 | 24.22145 | 552.7912 | 43.429 | 62.585 | No | 1905.969 | No | 858.42203 |
We want to study which variables are associated with systolic blood pressure (SBP).
\(Y\): systolic blood pressure (sbp)
Let’s start with age as a predictor variable.
We want to find \(\beta_0\) and \(\beta_1\) so that \[\mbox{sbp} = \beta_0 + \beta_1\mbox{age} + e\]
But first, let’s look at the relationship between these two variables.
We can compute the Pearson’s correlation coefficient to see how their association is.
cor.test() function computes the correlation coefficient, its 95% CI and the p-value.
\(\rho\)=0.489, 95%CI [0.456; 0.520] \(\longrightarrow\) Moderate positive correlation
p-value tests the hypothesis
p-value in a correlation test

Pearson’s correlation coefficient assumes that X and Y follow a normal distribution
Alternative: Spearman’s correlation coefficient (argument method in cor.test())
With a correlation matrix we can also test several variables at the same time
When the correlation between two variables is very high (correlation coefficients close to 1 or -1), it is tempting to assume that this means that one variable causes the other
Correlation is about the relationship between two variables, not about which variable causes the other.
Check the website: https://www.tylervigen.com/spurious-correlations
Anscombe’s quartet
lm() function, specifying the dependent and the independent variable.
Call:
lm(formula = sbp ~ age, data = regicor)
Residuals:
Min 1Q Median 3Q Max
-48.642 -12.825 -1.721 11.363 93.962
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 81.87772 1.87836 43.59 <2e-16 ***
age 0.90103 0.03366 26.77 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.71 on 2278 degrees of freedom
(14 observations deleted due to missingness)
Multiple R-squared: 0.2393, Adjusted R-squared: 0.239
F-statistic: 716.7 on 1 and 2278 DF, p-value: < 2.2e-16
We can visualize the table of the model using tbl_regression() from {gtsummary}
Vignette of tbl_regression(): https://www.danieldsjoberg.com/gtsummary/articles/tbl_regression.html
We can set the option to intercept set to TRUE to obtain the \(\beta_0\):
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | 82 | 78, 86 | <0.001 |
| age | 0.90 | 0.84, 0.97 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
\(\mbox{sbp} = 82 + 0.9\cdot\mbox{age}\)
\(\beta_0=82\): the expected value of systolic blood pressure when age = 0 (senseless)
p-value for age \(<0.001 \longrightarrow\) Age has a statistically significant effect on SBP
\(\beta_1 > 0\): if age increases, SBP increases.
\(\beta_1 = 0.9\): the expected SBP value for a subject with one year more would be 0.9 mmHg higher.
For a subject with 5 more years the increase in SBP will be of \(0.9 \cdot 5 = 4.5\) mmHg.
The goodness of fit of a regression model is assessed by the \(\mbox{R}^2\)
It is the percentage of the variability of \(Y\) explained by the model
It is a non-dimensional value that ranges from 0 to 1
We can add the \(\mbox{R}^2\) to our results table using add_glance_table().
After building the model, we have to check if it is appropriate for our data, by means of:
Dots should be in the line
Deviations allowed in the extremes
A correct pattern of residuals (\(e= \hat{Y} - Y\)):
Residuals randomly allocated around 0
The red line goes over the 0 value
The are no “extreme” residuals
The assumption of homoscedasticity is satisfied if:
Red line approximately horizontal across the plot
There is no clear pattern among the residuals
We consider a point to be influential when the Cook’s distance is larger than 1. But always take into account the context of your data.
Multiple linear regression is the generalization of a simple linear regression model by incorporating more than one predictor variable
Useful to investigate the effect of a set of predictor variables in a response variable
Useful to study the effect of one predictor variable adjusted by a set of covariates.
Given
Formula specification: \[Y=\beta_{0}+\beta_{1}X_1+\beta_{2}X_2+\ldots+\beta_{k}X_k+e\]
Coefficients \(\beta_i\) give us information about the effect of \(X_i\) on \(Y\)
If \(X_i\) is quantitative, \(\beta_i\) represents the change in the Y value when the variable \(X_i\) increases in one unit (when all other predictors are held constant):
If \(X_i\) is qualitative, \(\beta_i\) represents the change in the Y value with respect to the reference category of \(X_i\) (when all other predictors are held constant).
Let’s continue with the regicor dataset to analyze the effect of age (quantitative) and sex (qualitative) on systolic blood pressure
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | 85 | 81, 88 | <0.001 |
| age | 0.90 | 0.83, 0.96 | <0.001 |
| Sex | |||
| Male | — | — | |
| Female | -5.4 | -6.9, -4.0 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
\[\mbox{SBP}=85+0.9\cdot\{\mbox{age}\}-5.4\cdot \{\mbox{sex=Female}\}\]
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | 85 | 81, 88 | <0.001 |
| age | 0.90 | 0.83, 0.96 | <0.001 |
| Sex | |||
| Male | — | — | |
| Female | -5.4 | -6.9, -4.0 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
Both p-values are lower than 0.05 \(\longrightarrow\) Both age and sex are significantly associated with SBP
\(\beta_{age} = 0.90 > 0 \longrightarrow\) When age increases, SBP increases \(\longrightarrow\) For a one-year increase, SBP increases in 0.90 mmHg.
\(\beta_{female} = -5.4 < 0 \longrightarrow\) Females have lower SBP than Males \(\longrightarrow\) Females are expected to have 5.4 mmHg less SBP than males.
Adding significant variables increase the predictive ability of the model.
If we add more variables, \(R^2\) will increase.
Adjusted \(R^2\): modified version of \(R^2\) that is adjusted for the number of predictors.
We can predict the value of SBP given the characteristics of an individual with predict() function.
If no data argument is specified, the probabilities are computed for the data used to fit the model.
library(dplyr)
regicor2 <- regicor |>
filter(!is.na(sbp) & !is.na(age) & !is.na(sex)) #we can't predict with missing data
regicor2 <- regicor2 |>
mutate(pred = predict(m2, type="response", data=regicor2)) |> #store predictions
select(id,age,sex,pred) #select variables to visualize predictions
head(regicor2,4) id age sex pred
1 2265 70 Female 142.3198
2 1882 56 Female 129.7232
3 3000105616 37 Male 118.0482
4 3000103485 69 Female 141.4200
newdata.anova()function allows us to compare two nested models to see if there are differencesAnalysis of Variance Table
Model 1: sbp ~ age
Model 2: sbp ~ age + sex
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2278 714849
2 2277 698125 1 16724 54.547 2.119e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p-value contrast the hypothesis of equal explicability of the models.
p-value < 0.05 \(\longrightarrow\) Models m1 and m2 are different \(\longrightarrow\) Sex explains a part of SBP, independent of age.
Models (nested or not) can also be compared with the AIC.
Deviance is a measure of the lack of fit to the data in a regression model.
Akaike information criterion (AIC) \[\mbox{AIC} = \mbox{Deviance} + 2k,\] where k is the number of parameters in the model.
AIC penalizes for model complexity.
Smaller values indicate better fit.
The AIC for a model can be computed with AIC() function.
AIC for model m2 is lower
m2 (age and sex) fits the data better than m1 (age).
m2 |>
tbl_regression(intercept=TRUE) |>
add_glance_table(
include = c(nobs, r.squared, adj.r.squared, AIC))| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | 85 | 81, 88 | <0.001 |
| age | 0.90 | 0.83, 0.96 | <0.001 |
| Sex | |||
| Male | — | — | |
| Female | -5.4 | -6.9, -4.0 | <0.001 |
| No. Obs. | 2,280 | ||
| R² | 0.257 | ||
| Adjusted R² | 0.256 | ||
| AIC | 19,530 | ||
| Abbreviation: CI = Confidence Interval | |||
tbl_merge() functiont1 <- m1 |>
tbl_regression(intercept=TRUE)
t2 <- m2 |>
tbl_regression(intercept=TRUE)
tbl_merge(
list(t1, t2),
tab_spanner = c("**Model with sbp**", "**Model with sbp and sex**"))| Characteristic |
Model with sbp
|
Model with sbp and sex
|
||||
|---|---|---|---|---|---|---|
| Beta | 95% CI | p-value | Beta | 95% CI | p-value | |
| (Intercept) | 82 | 78, 86 | <0.001 | 85 | 81, 88 | <0.001 |
| age | 0.90 | 0.84, 0.97 | <0.001 | 0.90 | 0.83, 0.96 | <0.001 |
| Sex | ||||||
| Male | — | — | ||||
| Female | -5.4 | -6.9, -4.0 | <0.001 | |||
| Abbreviation: CI = Confidence Interval | ||||||
In many studies, the outcome variable of interest is the presence or absence of some event:
Outcome: binary/dichotomous variable \(\longrightarrow\) Logistic regression
Goal: describe the relationship between the binary outcome and several explanatory variables.
Evaluating whether a patient responds to a treatment for hypertension.
Identifying patients at risk of ischemic stroke based on lifestyle and medical data.
Predicting whether a patient has diabetes based on glucose levels, BMI, and insulin levels.
\[Y=\beta_0+\beta_1 X_1+\cdots+\beta_k X_k+e\]
it’s not what we need.
In logistic regression model we wanto to study the probability of the outcome \(P(Y=1)=p\), a value between 0 and 1.
We can compute the odds, which is the ratio of the probability of an event occurring divided by the probability of that event not occurring
\[\frac{p}{1-p}=\frac{P(Y=1)}{1-P(Y=1)}\]
\[\ln\left(\frac{p}{1-p}\right)=\ln\left(\frac{P(Y=1)}{1-P(Y=1)}\right)\]
\[\ln\left(\frac{p}{1-p}\right) = \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k\]
\(p=P(Y=1)\)
\(X_1, \ldots, X_k\): Variables of interest.
\(\beta_1, \ldots, \beta_k\): Regression coefficients of each variable.
\(\beta_1, \ldots, \beta_k\): indicate the magnitude of the effect of the corresponding variable.
For \(i=1,\ldots,k\), the \(p\)-values associated with these estimates correspond to
and they indicate whether the \(i\)-th variable has a statistically significant effect on the odds of Y=1.
The odds ratio (OR) for the \(i\)-th variable is \(\exp(\beta_i) = e^{\beta_i}\)
Quantitative variables: Change in the odds of the event occurring for each unit increase in the variable.
Qualitative variables: Change in the odds of the event occurring for each category of the variable with respect to the reference category.
OR interpretation
\(\mbox{Odds ratio}>1 \rightarrow\) The odds of the event occurring increases.
\(\mbox{Odds ratio}<1 \rightarrow\) The odds of the event occurring decreases.
\(\mbox{Odds ratio}=1 \rightarrow\) The odds of the event occurring remains the same.
Going back to the regicor database, we want to study if bmi has an effect on death risk.
We want to fit a logistic model
\[\ln\left(\frac{P(\mbox{death})}{1- P(\mbox{death})}\right)= \beta_0 + \beta_1 \cdot \mbox{bmi}\] ## Logistic model (quantitative variable)
glm() function, specifying the outcome, the adjustment variables, the data and family binomial.
Call:
glm(formula = death ~ bmi, family = binomial, data = regicor)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.83607 0.48094 -12.135 < 2e-16 ***
bmi 0.11882 0.01592 7.466 8.3e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1199.4 on 2124 degrees of freedom
Residual deviance: 1145.2 on 2123 degrees of freedom
(23 observations deleted due to missingness)
AIC: 1149.2
Number of Fisher Scoring iterations: 5
We can visualise the model results in a fancy table:
Using tbl_regression() function from {gtsummary} package
Specifying exp=TRUE to show Odds Ratio instead of log-odds.
| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| bmi | 1.13 | 1.09, 1.16 | <0.001 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
bmi on death.| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| bmi | 1.13 | 1.09, 1.16 | <0.001 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
OR > 1 \(\longrightarrow\) If bmi increases, the odds of dying increase.
OR = 1.13 \(\longrightarrow\) For a 1-unit increase in the bmi, the odds of dying are multiplied by 1.13 \(\longrightarrow\) For a 1-unit increase in the bmi, the odds of dying are increased by 13%.
OR for a 5-units increase: \[e^{5 \cdot \beta_{bmi}} = (e^{\beta_{bmi}})^5=(\mbox{OR}_{bmi})^5=1.13^5=1.84\]
We can predict the probability of death for an individual given its bmi with the predict() function using type=response.
If no data argument is specified, the probabilities are computed for the data used to fit the model.
regicor2 <- regicor |>
filter(!is.na(bmi))
regicor2 <- regicor2 |>
mutate(pred = predict(m1, type="response", data=regicor2)) |>
select(id,bmi,death,pred)
head(regicor2) id bmi death pred
1 2265 25.00000 Yes 0.05388009
2 1882 25.21736 No 0.05521190
3 3000105616 24.22145 No 0.04935442
4 3000103485 31.46837 No 0.10938419
5 3000100883 17.42509 No 0.02262871
6 3000108476 29.44676 Yes 0.08808444
We want to study the association of smoking and death.
Smoking has three categories: Never smoker, Current or former < 1y and Former >= 1y. Let’s see how the categories are ordered:
Never smoker being the reference category, we are going to build the model\[\ln\left(\frac{P(\mbox{death})}{1- P(\mbox{death})}\right)= \beta_0 + \beta_1 \cdot \mbox{\{Former >= 1y\}} + \beta_2 \cdot \mbox{\{Current or former < 1y\}}\]
Call:
glm(formula = death ~ smoker, family = binomial, data = regicor)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.5041 0.1122 -22.327 < 2e-16 ***
smokerFormer >= 1y -0.5895 0.2658 -2.218 0.02656 *
smokerCurrent or former < 1y 0.5479 0.1706 3.211 0.00132 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1196.8 on 2109 degrees of freedom
Residual deviance: 1175.0 on 2107 degrees of freedom
(38 observations deleted due to missingness)
AIC: 1181
Number of Fisher Scoring iterations: 5
| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| smoker | |||
| Never smoker | — | — | |
| Former >= 1y | 0.55 | 0.32, 0.91 | 0.027 |
| Current or former < 1y | 1.73 | 1.24, 2.41 | 0.001 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
For Former >=1y, OR < 1 \(\longrightarrow\) Former smokers have less odds of dying than Never smokers.
For Current or former < 1y, OR > 1 \(\longrightarrow\) Current or recent former smoker have more odds of dying than Never smokers.
| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| smoker | |||
| Never smoker | — | — | |
| Former >= 1y | 0.57 | 0.33, 0.94 | 0.035 |
| Current or former < 1y | 2.09 | 1.47, 2.95 | <0.001 |
| bmi | 1.14 | 1.10, 1.17 | <0.001 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
regicor3 <- regicor |>
filter(!is.na(smoker) & !is.na(bmi))
regicor3 <- regicor3 |>
mutate(pred = predict(m3, type="response", data=regicor3)) |>
select(id,bmi,death,pred, smoker)
head(regicor3) id bmi death pred smoker
1 2265 25.00000 Yes 0.04586578 Never smoker
2 1882 25.21736 No 0.04710176 Never smoker
3 3000105616 24.22145 No 0.08324754 Current or former < 1y
4 3000103485 31.46837 No 0.09928560 Never smoker
5 3000100883 17.42509 No 0.03657965 Current or former < 1y
6 3000108476 29.44676 Yes 0.07838014 Never smoker
ggplot() function from {ggplot2} package.Statistical models are a useful way to analyze the relationship between an outcome variable and several predictor variables.
Adjust for confounders when evaluating intervention effects.
Use DAGs when assessing causal relationships.
Exploratory analyses allow more flexibility, but model assumptions must always be checked.
Continuous variable \(\longrightarrow\) Linear regression
Dichotomous variable \(\longrightarrow\) Logistic regression
Internally and externally validate predictive models.
Always check the residuals and the goodness of fit of models
Applied Biostatistics Course with R